CS4132 Data Analytics
Scrabble is a game where players construct words out of letters. Each letter has a preset value based on its frequency, and these values are used to determine how many points each play made by a player is worth. However, a higher tile value does not correspond to a more generally valuable tile, and it even may be the opposite. Hence, I would like to discover what the true value of these tiles are, along with other aspects of competitive play in scrabble, by analyzing datasets of games in scrabble competitions since 2005, with upwards of 4000 games to look at.
I have been an avid word game enthusiast since I was young, dabbling in various word games, including Scrabble. As a casual player, my plays were often limited by my knowledge of words. The accepted dictionaries in Scrabble contain many obscure words, like QADI,KEX and ETAERIO to name a few. So, I always wondered whether this limited vocabulary still affected players at the very high level.
I also wondered what tiles were the best to fish out. This especially intrigued me because while the Q tile has the highest value of 10, I frequently felt like drawing that tile out from the bag was a burden to my rack, as I held on to it really long, affecting my chances of bingoing amongst other things. Hence, I wanted to know if Q was really worth it to keep.
Scrabble is a board game initially developed in 1938, which people can play, or watch others play, for entertainment. It is a turn-based word game where players construct words to play. Each player has a rack of 7 tiles(unseen to opponents), each tile bearing a letter. The board where tiles are placed on is a 15x15 grid, with various multipliers including the DOUBLE/TRIPLE LETTER i.e. DL/TL and the DOUBLE/TRIPLE WORD i.e. DW/TW.
The first player must play a word through the middle square, and every word played afterwards must connect to an existing word on the board. Each letter tile has a preassigned value, e.g. 1 for S, 2 for D, etc. The blank tile can be chosen to bear any letter, but gives 0 points. Plays are scored based on the values of its letters, and the multipliers through which the word is played.
In casual play, players will only use common words in a normal vocabulary. They usually play the words perpendicular to existing words, and usually do not really consider the points gained per play.
In competitive play, players usually have a large portion of the dictionary memorized, especially 2-4 letter words and an arsenal of 7 letter words. They will occasionally opt for more advanced plays like parallel plays, and effectively place higher-value letters or words on the relevant multipliers. These players commonly look for bingos, where all 7 tiles on their rack are used in making a word, netting the player a 50-point bonus.
a) How accurate is the existing tile bag today?
Very accurate
b) What is the true value of tiles?
Blank and S are the best, followed by R, T then E. The worst is V, followed by W then U.
a) What factors does the frequency of words played in competitive play depend on?
the most contributing factor is the probability of drawing a specific word
b) What is the average pattern of words played in games?
Most tiles cluster around the multipliers, except for the blank which is the converse. Tiles also tend to be played in the bottom left triangle rather than the upper right triangle.
all the games
scraped from https://www.cross-tables.com/annolistself.php
word lists
from NASPA zyzzyva app https://scrabbleplayers.org/w/Zyzzyva
each tile's frequency and value
%pip install tqdm
Requirement already satisfied: tqdm in c:\users\user\anaconda3\lib\site-packages (4.64.0) Requirement already satisfied: colorama in c:\users\user\anaconda3\lib\site-packages (from tqdm) (0.4.4) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random
from scipy import stats
from tqdm.auto import tqdm
import re
import joypy
from itertools import chain, combinations
from collections import Counter
import urllib
import requests
import timeit
import plotly.express as px
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from plotly.subplots import make_subplots
tqdm.pandas()
data = pd.read_csv("scrabble.csv")
data.tail()
| Unnamed: 0 | Game | P1 | P2 | Comp | Date | Dict | |
|---|---|---|---|---|---|---|---|
| 39712 | 39712 | #character-encoding UTF-8\r\n#player1 Matthew_... | Matthew Tunnicliffe | vs. Joshua Sokol | Tournament game | NaN | CSW21 |
| 39713 | 39713 | #character-encoding UTF-8\n#description Create... | Cesar Del Solar | vs. Jeffrey Jacobson | Teaneck, NJ | (05/01/22) | Round 4 | NWL20 |
| 39714 | 39714 | #character-encoding UTF-8\n#description Create... | Judy Cole | vs. Cesar Del Solar | Teaneck, NJ | (05/01/22) | Round 5 | NWL20 |
| 39715 | 39715 | #character-encoding UTF-8\n#description Create... | wrathkahn vs. fritzlein | NaN | NaN | NaN | NWL18 |
| 39716 | 39716 | #character-encoding UTF-8\n#description Create... | fritzlein vs. wrathkahn | NaN | NaN | NaN | NWL18 |
data["Dict"].value_counts()
TWL06 13741 TWL15 6039 CSW15 4229 NWL18 3511 CSW12 2538 CSW19 2168 NWL20 1265 CSW21 820 CSW07 600 TWL98 321 unknown 20 THTWL85 4 Name: Dict, dtype: int64
Have all been obtained from the official NASPA Zyzzyva (https://scrabbleplayers.org/w/Zyzzyva) versions 3.0.4, 3.2.4 and 3.3.0
CSW21 = open("CSW21.txt", encoding="UTF-8").read().split("\n")
CSW19 = open("CSW19.txt", encoding="UTF-8").read().split("\n")
CSW15 = open("CSW15.txt", encoding="UTF-8").read().split("\n")
CSW12 = open("CSW12.txt", encoding="UTF-8").read().split("\n")
CSW07 = open("CSW07.txt", encoding="UTF-8").read().split("\n")
NWL20 = open("NWL2020.txt", encoding="UTF-8").read().split("\n")
NWL18 = open("NWL2018.txt", encoding="UTF-8").read().split("\n")
TWL15 = open("OTCWL2016.txt", encoding="UTF-8").read().split("\n")
TWL06 = open("OWL2.txt", encoding="UTF-8").read().split("\n")
TWL98 = open("OWL.txt", encoding="UTF-8").read().split("\n")
wordlists = {"CSW21": CSW21,
"CSW19": CSW19,
"CSW15": CSW15,
"CSW12": CSW12,
"CSW07": CSW07,
"NWL20": NWL20,
"NWL18": NWL18,
"TWL15": TWL15,
"TWL06": TWL06,
"TWL98": TWL98}
tile_bag = pd.read_html("https://en.wikipedia.org/wiki/Scrabble_letter_distributions")[3]
tile_bag
| Unnamed: 0 | ×1 | ×2 | ×3 | ×4 | ×6 | ×8 | ×9 | ×12 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | NaN | [blank] | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1 | NaN | NaN | NaN | L S U | N R T | O | A I | E |
| 2 | 2 | NaN | NaN | G | D | NaN | NaN | NaN | NaN |
| 3 | 3 | NaN | B C M P | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 4 | NaN | F H V W Y | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 5 | K | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | 8 | J X | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 10 | Q Z | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df_unigram = pd.read_csv("unigram_freq.csv")
df_unigram
| word | count | |
|---|---|---|
| 0 | the | 23135851162 |
| 1 | of | 13151942776 |
| 2 | and | 12997637966 |
| 3 | to | 12136980858 |
| 4 | a | 9081174698 |
| ... | ... | ... |
| 333328 | gooek | 12711 |
| 333329 | gooddg | 12711 |
| 333330 | gooblle | 12711 |
| 333331 | gollgo | 12711 |
| 333332 | golgw | 12711 |
333333 rows × 2 columns
board_layout = pd.read_csv("board_layout.csv", header=None)
board_layout
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3W | NaN | NaN | 2L | NaN | NaN | NaN | 3W | NaN | NaN | NaN | 2L | NaN | NaN | 3W |
| 1 | NaN | 2W | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 2W | NaN |
| 2 | NaN | NaN | 2W | NaN | NaN | NaN | 2L | NaN | 2L | NaN | NaN | NaN | 2W | NaN | NaN |
| 3 | 2L | NaN | NaN | 2W | NaN | NaN | NaN | 2L | NaN | NaN | NaN | 2W | NaN | NaN | 2L |
| 4 | NaN | NaN | NaN | NaN | 2W | NaN | NaN | NaN | NaN | NaN | 2W | NaN | NaN | NaN | NaN |
| 5 | NaN | 3L | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 3L | NaN |
| 6 | NaN | NaN | 2L | NaN | NaN | NaN | 2L | NaN | 2L | NaN | NaN | NaN | 2L | NaN | NaN |
| 7 | 3W | NaN | NaN | 2L | NaN | NaN | NaN | 2W | NaN | NaN | NaN | 2L | NaN | NaN | 3W |
| 8 | NaN | NaN | 2L | NaN | NaN | NaN | 2L | NaN | 2L | NaN | NaN | NaN | 2L | NaN | NaN |
| 9 | NaN | 3L | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 3L | NaN |
| 10 | NaN | NaN | NaN | NaN | 2W | NaN | NaN | NaN | NaN | NaN | 2W | NaN | NaN | NaN | NaN |
| 11 | 2L | NaN | NaN | 2W | NaN | NaN | NaN | 2L | NaN | NaN | NaN | 2W | NaN | NaN | 2L |
| 12 | NaN | NaN | 2W | NaN | NaN | NaN | 2L | NaN | 2L | NaN | NaN | NaN | 2W | NaN | NaN |
| 13 | NaN | 2W | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 3L | NaN | NaN | NaN | 2W | NaN |
| 14 | 3W | NaN | NaN | 2L | NaN | NaN | NaN | 3W | NaN | NaN | NaN | 2L | NaN | NaN | 3W |
Since we only care about competitive play, we shall filter out games played in competitions. (The data for non-competitions is really messed up anyways because people submit fake games)
data = data[data["Comp"].notna()].copy()
data's columns were scraped directly so we have to clean up columns P2, Date, Dict
# Remove "vs. "
data["P2"] = data["P2"].str[4:]
data.head()
| Unnamed: 0 | Game | P1 | P2 | Comp | Date | Dict | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | #player1 John_O'Laughlin John O'Laughlin\r\n#p... | John O'Laughlin | Seth Lipkin | Stamford CT | (06/15/07) | Round 2 | TWL06 |
| 7 | 7 | #player1 Seth Seth\r\n#player2 Joel_Horn Joel ... | Seth Lipkin | Joel Horn | Stamford CT | (06/15/07) | Round 5 | TWL06 |
| 8 | 8 | #player1 Debbie_Stegman Debbie Stegman\r\n#pla... | Debbie Stegman | Seth Lipkin | Stamford CT | (06/15/07) | Round 3 | TWL06 |
| 9 | 9 | #player1 k8 k8\n#player2 dl dl\n>k8: EEEFGTY 8... | Kate Fukawa-Connelly | Dielle Pinto | Boston MA | (07/07/07) | Round 8 | TWL06 |
| 10 | 10 | #player1 Seth Seth\r\n#player2 Ben_Dweck Ben D... | Seth Lipkin | Ben Dweck | Stamford CT | (06/15/07) | Round 6 | TWL06 |
# Extract date between brackets and turn it into datetime object
data["Date"] = pd.to_datetime(data["Date"].str[1:9])
data
| Unnamed: 0 | Game | P1 | P2 | Comp | Date | Dict | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | #player1 John_O'Laughlin John O'Laughlin\r\n#p... | John O'Laughlin | Seth Lipkin | Stamford CT | 2007-06-15 | TWL06 |
| 7 | 7 | #player1 Seth Seth\r\n#player2 Joel_Horn Joel ... | Seth Lipkin | Joel Horn | Stamford CT | 2007-06-15 | TWL06 |
| 8 | 8 | #player1 Debbie_Stegman Debbie Stegman\r\n#pla... | Debbie Stegman | Seth Lipkin | Stamford CT | 2007-06-15 | TWL06 |
| 9 | 9 | #player1 k8 k8\n#player2 dl dl\n>k8: EEEFGTY 8... | Kate Fukawa-Connelly | Dielle Pinto | Boston MA | 2007-07-07 | TWL06 |
| 10 | 10 | #player1 Seth Seth\r\n#player2 Ben_Dweck Ben D... | Seth Lipkin | Ben Dweck | Stamford CT | 2007-06-15 | TWL06 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 39710 | 39710 | #character-encoding UTF-8\r\n#player1 Jeremy_H... | Jeremy Hildebrand | Matthew Tunnicliffe | Ottawa, ON | 2022-09-17 | CSW21 |
| 39711 | 39711 | #character-encoding UTF-8\r\n#player1 Deen_Her... | Deen Hergott | Matthew Tunnicliffe | Ottawa, ON | 2022-09-17 | CSW21 |
| 39712 | 39712 | #character-encoding UTF-8\r\n#player1 Matthew_... | Matthew Tunnicliffe | Joshua Sokol | Tournament game | NaT | CSW21 |
| 39713 | 39713 | #character-encoding UTF-8\n#description Create... | Cesar Del Solar | Jeffrey Jacobson | Teaneck, NJ | 2022-05-01 | NWL20 |
| 39714 | 39714 | #character-encoding UTF-8\n#description Create... | Judy Cole | Cesar Del Solar | Teaneck, NJ | 2022-05-01 | NWL20 |
30224 rows × 7 columns
Some people upload games against Quackle (a bot), so we remove games with that player.
data_quackle = data[((data["P1"] == "Quackle") | (data["P2"] == "Quackle"))].copy()
data = data[~((data["P1"] == "Quackle") | (data["P2"] == "Quackle"))].copy()
Now we will remove the definitions from some of the dictionaries
for i in wordlists.items():
print(i[0])
print(i[1][0])
CSW21 AA CSW19 AA CSW15 AA CSW12 AA a rough cindery lava [n -S] CSW07 AA a rough cindery lava [n -S] NWL20 AA rough, cindery lava [n AAS] NWL18 AA rough, cindery lava [n AAS] TWL15 AA rough, cindery lava [n AAS] TWL06 AA rough, cindery lava [n -S] TWL98 AA rough, cindery lava [n -S]
Every dictionary except for CSW 15,19,21 has definitions, so we will remove the definitions now.
for key in ["CSW12", "CSW07", "NWL20", "NWL18", "TWL15", "TWL06", "TWL98"]:
wordlist = wordlists[key]
wordlists[key] = [word.split(" ", 1)[0] for word in wordlists[key]]
for i in wordlists.items():
print(i[0])
print(i[1][0])
CSW21 AA CSW19 AA CSW15 AA CSW12 AA CSW07 AA NWL20 AA NWL18 AA TWL15 AA TWL06 AA TWL98 AA
Now, we reassign the values to the variables.
CSW21 = wordlists["CSW21"]
CSW19 = wordlists["CSW19"]
CSW15 = wordlists["CSW15"]
CSW12 = wordlists["CSW12"]
CSW07 = wordlists["CSW07"]
NWL20 = wordlists["NWL20"]
NWL18 = wordlists["NWL18"]
TWL15 = wordlists["TWL15"]
TWL06 = wordlists["TWL06"]
TWL98 = wordlists["TWL98"]
b = data["Game"][0]
s = pd.Series(b.split("\n"))
print(s)
s[s.str.match(">.*:( .*){1}( [\d\w]*)( .*){3}")].str.split(" ", expand=True)
0 #player1 John_O'Laughlin John O'Laughlin\r 1 #player2 Seth Seth\r 2 >John_O'Laughlin: IQ 8G QI +22 22\r 3 >Seth: GNNOOOV -GOOONV +0 0\r 4 #note VOGON* is still not good...\r 5 >John_O'Laughlin: IOP 9E POI +18 40\r 6 >Seth: EENSSVX 10D VEX +58 58\r 7 >John_O'Laughlin: AJOS 11E SOJA +42 82\r 8 >Seth: ENNSSTU I9 UNS +16 74\r 9 >John_O'Laughlin: AERW H11 .WARE +36 118\r 10 >Seth: EEINNST J4 INTENSE +67 141\r 11 >John_O'Laughlin: AY G11 .AY +31 149\r 12 >Seth: AEHRRTT 12A HEART +30 171\r 13 >John_O'Laughlin: GRUU K10 GURU +14 163\r 14 >Seth: AERTTTZ A12 .AZE +48 219\r 15 >John_O'Laughlin: DEGIMO 5E DEMOI.G +44 207\r 16 >Seth: BIRTTTU H1 TURB.T +33 252\r 17 >John_O'Laughlin: DFI 6D DIF +31 238\r 18 >Seth: CIILPTW 15D TWIC. +11 263\r 19 >John_O'Laughlin: ADIILNO 1D DILA.ION +61 299\r 20 >Seth: ACILPSY 2J PALSY +44 307\r 21 >John_O'Laughlin: ENO 14A .ONE +28 327\r 22 >Seth: ?BCDIKO 13J B.CK +24 331\r 23 >John_O'Laughlin: ELLORRT M2 .TROLLER +70 397\r 24 >Seth: ??DEFIO 3A mODIFiE. +72 403\r 25 >John_O'Laughlin: AEEGHMO N10 MAHOE +45 442\r 26 >Seth: AGN O9 NAG +11 414\r 27 >Seth: (EV) +10 424\r 28 dtype: object
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 2 | >John_O'Laughlin: | IQ | 8G | QI | +22 | 22\r |
| 5 | >John_O'Laughlin: | IOP | 9E | POI | +18 | 40\r |
| 6 | >Seth: | EENSSVX | 10D | VEX | +58 | 58\r |
| 7 | >John_O'Laughlin: | AJOS | 11E | SOJA | +42 | 82\r |
| 8 | >Seth: | ENNSSTU | I9 | UNS | +16 | 74\r |
| 9 | >John_O'Laughlin: | AERW | H11 | .WARE | +36 | 118\r |
| 10 | >Seth: | EEINNST | J4 | INTENSE | +67 | 141\r |
| 11 | >John_O'Laughlin: | AY | G11 | .AY | +31 | 149\r |
| 12 | >Seth: | AEHRRTT | 12A | HEART | +30 | 171\r |
| 13 | >John_O'Laughlin: | GRUU | K10 | GURU | +14 | 163\r |
| 14 | >Seth: | AERTTTZ | A12 | .AZE | +48 | 219\r |
| 15 | >John_O'Laughlin: | DEGIMO | 5E | DEMOI.G | +44 | 207\r |
| 16 | >Seth: | BIRTTTU | H1 | TURB.T | +33 | 252\r |
| 17 | >John_O'Laughlin: | DFI | 6D | DIF | +31 | 238\r |
| 18 | >Seth: | CIILPTW | 15D | TWIC. | +11 | 263\r |
| 19 | >John_O'Laughlin: | ADIILNO | 1D | DILA.ION | +61 | 299\r |
| 20 | >Seth: | ACILPSY | 2J | PALSY | +44 | 307\r |
| 21 | >John_O'Laughlin: | ENO | 14A | .ONE | +28 | 327\r |
| 22 | >Seth: | ?BCDIKO | 13J | B.CK | +24 | 331\r |
| 23 | >John_O'Laughlin: | ELLORRT | M2 | .TROLLER | +70 | 397\r |
| 24 | >Seth: | ??DEFIO | 3A | mODIFiE. | +72 | 403\r |
| 25 | >John_O'Laughlin: | AEEGHMO | N10 | MAHOE | +45 | 442\r |
| 26 | >Seth: | AGN | O9 | NAG | +11 | 414\r |
we will now clean up the tile bag.
tile_bag
| Unnamed: 0 | ×1 | ×2 | ×3 | ×4 | ×6 | ×8 | ×9 | ×12 | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | NaN | [blank] | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 1 | NaN | NaN | NaN | L S U | N R T | O | A I | E |
| 2 | 2 | NaN | NaN | G | D | NaN | NaN | NaN | NaN |
| 3 | 3 | NaN | B C M P | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 4 | NaN | F H V W Y | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 5 | K | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | 8 | J X | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | 10 | Q Z | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
tile_bag = tile_bag.set_index(tile_bag.columns[0])
tile_bag
| ×1 | ×2 | ×3 | ×4 | ×6 | ×8 | ×9 | ×12 | |
|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | ||||||||
| 0 | NaN | [blank] | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | L S U | N R T | O | A I | E |
| 2 | NaN | NaN | G | D | NaN | NaN | NaN | NaN |
| 3 | NaN | B C M P | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | F H V W Y | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | K | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | J X | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | Q Z | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
index=[]
count=[]
value=[]
for i in tile_bag.index:
for c in tile_bag.columns:
letters = tile_bag.loc[i, c]
if pd.notnull(letters):
for letter in letters.split(" "):
index.append(letter)
count.append(c)
value.append(i)
tile_bag = pd.DataFrame(index=index, data={"Count": count, "Value": value}).sort_index()
tile_bag.head()
| Count | Value | |
|---|---|---|
| A | ×9 | 1 |
| B | ×2 | 3 |
| C | ×2 | 3 |
| D | ×4 | 2 |
| E | ×12 | 1 |
# remove × value from count, turn to int
tile_bag["Count"] = pd.to_numeric(tile_bag["Count"].str[1:])
tile_bag.head()
| Count | Value | |
|---|---|---|
| A | 9 | 1 |
| B | 2 | 3 |
| C | 2 | 3 |
| D | 4 | 2 |
| E | 12 | 1 |
# remove games that could not be found
data = data[~data["Game"].str.contains("File not found")].copy()
data["Game"].str.contains("File not found").value_counts()
False 29994 Name: Game, dtype: int64
To check if a game is valid, one of the things I will do is check if the tiles used are actually a subset of the real bag
counter_tilebag = Counter("".join(tile_bag["Count"] * tile_bag.index).replace("[blank]", " "))
re_lowercase = re.compile("[a-z]")
re_lowercase
re.compile(r'[a-z]', re.UNICODE)
def isinbag(plays):
played = re_lowercase.sub(" ","".join(plays).replace(".",""))
counter = Counter(played)
return all(counter_tilebag[i] >= counter[i] for i in counter)
# find faulty games:
# wrongly formatted or non-english words or obviously fake games (plays with scores above 392)
faulty = []
def check_invalid_row(row):
s = pd.Series(row["Game"].split("\n"))
s = s.drop(np.where(s.str.match(">.*: +[\w?]+ +\-\- +\-[\d]+ +[\d]+"))[0] - 1)
df = s[s.str.match(">.*: [\w?]+ [\d\w]+ [\w.]+ \+[\d]+ [\d]+(\\r)?")].str.split(" ", 5, expand=True)
if len(df.columns)!=6 or \
False in df[3].str.match("[a-zA-Z.]*").values or \
True in (pd.to_numeric(df[4].str[1:], errors="coerce")>392).values or \
not isinbag(df[3]):
global faulty
faulty.append(row.name)
data.progress_apply(check_invalid_row, axis=1);
0%| | 0/29994 [00:00<?, ?it/s]
len(faulty)
945
Since the amount of faulty data, 945, is very little compared to the total amount of data, ~40000, we will drop it.
data = data.drop(faulty)
df_unigram["word"] = df_unigram["word"].str.upper()
df_unigram
| word | count | |
|---|---|---|
| 0 | THE | 23135851162 |
| 1 | OF | 13151942776 |
| 2 | AND | 12997637966 |
| 3 | TO | 12136980858 |
| 4 | A | 9081174698 |
| ... | ... | ... |
| 333328 | GOOEK | 12711 |
| 333329 | GOODDG | 12711 |
| 333330 | GOOBLLE | 12711 |
| 333331 | GOLLGO | 12711 |
| 333332 | GOLGW | 12711 |
333333 rows × 2 columns
Clear the nans in board_layout and convert it to empty strings
board_layout = np.array(board_layout.fillna(""))
board_layout
array([['3W', '', '', '2L', '', '', '', '3W', '', '', '', '2L', '', '',
'3W'],
['', '2W', '', '', '', '3L', '', '', '', '3L', '', '', '', '2W',
''],
['', '', '2W', '', '', '', '2L', '', '2L', '', '', '', '2W', '',
''],
['2L', '', '', '2W', '', '', '', '2L', '', '', '', '2W', '', '',
'2L'],
['', '', '', '', '2W', '', '', '', '', '', '2W', '', '', '', ''],
['', '3L', '', '', '', '3L', '', '', '', '3L', '', '', '', '3L',
''],
['', '', '2L', '', '', '', '2L', '', '2L', '', '', '', '2L', '',
''],
['3W', '', '', '2L', '', '', '', '2W', '', '', '', '2L', '', '',
'3W'],
['', '', '2L', '', '', '', '2L', '', '2L', '', '', '', '2L', '',
''],
['', '3L', '', '', '', '3L', '', '', '', '3L', '', '', '', '3L',
''],
['', '', '', '', '2W', '', '', '', '', '', '2W', '', '', '', ''],
['2L', '', '', '2W', '', '', '', '2L', '', '', '', '2W', '', '',
'2L'],
['', '', '2W', '', '', '', '2L', '', '2L', '', '', '', '2W', '',
''],
['', '2W', '', '', '', '3L', '', '', '', '3L', '', '', '', '2W',
''],
['3W', '', '', '2L', '', '', '', '3W', '', '', '', '2L', '', '',
'3W']], dtype=object) According to some accounts, the creator of Scrabble, Alfred Butts, studied the front page of the New York Times to determine the tile distributions. Now, we will see if that is consistent with the letter frequencies in the dictionaries.
Firstly, I suspect that american and british dictionaries could have significant differences in distributions Hence, we will first verify this by comparing the latest british word list(CSW22) and american word list(NWL20).
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
CSW21_string = "".join(CSW21)
NWL20_string = "".join(NWL20)
letter_freqs_CSW21 = [CSW21_string.count(letter)/len(CSW21_string) for letter in letters]
letter_freqs_NWL20 = [NWL20_string.count(letter)/len(NWL20_string) for letter in letters]
df_letter_freqs_CSW21 = pd.DataFrame({"Letter": list(letters), "Frequency": letter_freqs_CSW21, "Lexicon": "CSW21"})
df_letter_freqs_NWL20 = pd.DataFrame({"Letter": list(letters), "Frequency": letter_freqs_NWL20, "Lexicon": "NWL20"})
df_letter_freqs = pd.concat([df_letter_freqs_CSW21, df_letter_freqs_NWL20])
sns.barplot(x="Letter", y="Frequency", data=df_letter_freqs, hue="Lexicon")
plt.title("Letter distributions in CSW21, NWL20")
plt.show()
From there, we can conclude that the letter distributions in both word lists are quite similar. Hence, we will just use the american dictionary CSW21 as it has more words, comparing it to the actual distribution of tile values and frequency in tile bag.
Since a more common word should have a lower value, tile values should be inversely proportional to actual frequency. Frequency in tile bag should be directly proportional to actual frequency.
tile_bag_relative = tile_bag / tile_bag.sum()
tile_bag_relative = tile_bag_relative.rename(columns={"Count": "Frequency in bag", "Value": "Relative value"})
tile_bag_relative.head()
| Frequency in bag | Relative value | |
|---|---|---|
| A | 0.09 | 0.011494 |
| B | 0.02 | 0.034483 |
| C | 0.02 | 0.034483 |
| D | 0.04 | 0.022989 |
| E | 0.12 | 0.011494 |
df_letters = df_letter_freqs_CSW21.join(tile_bag_relative, on="Letter")
df_letters = df_letters.rename(columns={"Frequency": "Frequency in CSW21"})
df_letters.head()
| Letter | Frequency in CSW21 | Lexicon | Frequency in bag | Relative value | |
|---|---|---|---|---|---|
| 0 | A | 0.077325 | CSW21 | 0.09 | 0.011494 |
| 1 | B | 0.018586 | CSW21 | 0.02 | 0.034483 |
| 2 | C | 0.040095 | CSW21 | 0.02 | 0.034483 |
| 3 | D | 0.033565 | CSW21 | 0.04 | 0.022989 |
| 4 | E | 0.112823 | CSW21 | 0.12 | 0.011494 |
plt.figure(figsize=(20,20))
sns.scatterplot(x="Frequency in CSW21", y="Frequency in bag", data=df_letters, s=100, color="lightgray")
def label_point(row):
ax = plt.gca()
ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["Frequency in bag"]))
df_letters.apply(label_point, axis=1)
mi = min(df_letters["Frequency in CSW21"].min(), df_letters["Frequency in bag"].min())
ma = max(df_letters["Frequency in CSW21"].max(), df_letters["Frequency in bag"].max())
plt.plot([mi,ma],[mi,ma], color="red")
plt.title("Comparison of letter frequency in tile bag vs actual word list (CSW21)")
plt.show()
We can see that most of the points lie near the y=x line. This tells us that the frequencies of tiles in the bag, and in word list are relatively similar. We can see an outlier, the S tile, which is a lot less common in the bag than in the word list.
plt.figure(figsize=(20,20))
sns.scatterplot(x="Frequency in CSW21", y="Relative value", data=df_letters)
def label_point(row):
ax = plt.gca()
ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["Relative value"]))
df_letters.apply(label_point, axis=1)
plt.title("Letter's relative value vs frequency in word list (CSW21)")
plt.show()
From this scatter plot we can tell that the relationship between frequency in word list and tile value is likely not linear, as it looks more like a inversely proportional relationship.
df_letters["ln Relative value"] = np.log(df_letters["Relative value"])
df_letters.head()
| Letter | Frequency in CSW21 | Lexicon | Frequency in bag | Relative value | ln Relative value | |
|---|---|---|---|---|---|---|
| 0 | A | 0.077325 | CSW21 | 0.09 | 0.011494 | -4.465908 |
| 1 | B | 0.018586 | CSW21 | 0.02 | 0.034483 | -3.367296 |
| 2 | C | 0.040095 | CSW21 | 0.02 | 0.034483 | -3.367296 |
| 3 | D | 0.033565 | CSW21 | 0.04 | 0.022989 | -3.772761 |
| 4 | E | 0.112823 | CSW21 | 0.12 | 0.011494 | -4.465908 |
plt.figure(figsize=(20,20))
sns.regplot(x="Frequency in CSW21", y="ln Relative value", data=df_letters)
def label_point(row):
ax = plt.gca()
ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["ln Relative value"]))
df_letters.apply(label_point, axis=1)
plt.title("Ln of letter's relative value vs frequency in word list (CSW21)")
plt.show()
stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])
(-0.8716204401113973, 6.767143180660017e-09)
With a pearson coefficient of -0.87, very close to -1, we can conclude that frequency of letter in word list and the inverse of its tile value are strongly linearly correlated, meaning that the frequency of letter in word list and its tile value have a strong inversely proportionate correlation. (The low p-value is due to little data points)
What tiles are truly the best to get, and which are the worst? I often see competitive players dump their Q as fast as possible, despite it having a value of 10. This made me wonder whether Q was truly that bad.
To determine the tiles' true values, we will look at the distrbution of points scored from plays with those tiles. For example, if we seek the true value of G, we look for every word played containing G, and look at the distribution of their scores.
Since we are using the data on games now, we first need to define a function to convert the GCG format to a readable dataframe.
def get_valid_words(game):
s = pd.Series(game.split("\n"))
#drop challenges
s = s.drop(np.where(s.str.match(">.*: +[\w?]+ +\-\- +\-[\d]+ +[\d]+"))[0] - 1)
# Plays (not comments) all start with >
# Then, we match the rest of the string with the format of a valid play:
# >[name]: [rack] [position] [play] [score] [cum score]
# e.g.
# >John: MEGADRY N8 GAME +13 45
df = s[s.str.match(">.*: +[\w?]+ +[\d\w]+ +[\w.]+ +\+[\d]+ +[\d]+")].str.split(" +", expand=True)
if len(df.columns)==6:
df.columns = ["Name", "Rack", "Position", "Play", "Score", "Cum Score"]
df["Position"] = df["Position"].str.upper()
df["Score"] = pd.to_numeric(df["Score"].str[1:])
# cum score is kind of irrelevant for this entire project, so we will not bother
return df
else:
return None
We will create a df of only the plays
data_plays = pd.concat(data["Game"].progress_map(get_valid_words).to_list(), keys=data.index)
data_plays.head()
0%| | 0/29049 [00:00<?, ?it/s]
| Name | Rack | Position | Play | Score | Cum Score | ||
|---|---|---|---|---|---|---|---|
| 0 | 2 | >John_O'Laughlin: | IQ | 8G | QI | 22 | 22\r |
| 5 | >John_O'Laughlin: | IOP | 9E | POI | 18 | 40\r | |
| 6 | >Seth: | EENSSVX | 10D | VEX | 58 | 58\r | |
| 7 | >John_O'Laughlin: | AJOS | 11E | SOJA | 42 | 82\r | |
| 8 | >Seth: | ENNSSTU | I9 | UNS | 16 | 74\r |
Now we will go through every game in our data and add relevant entries to a new dataframe.
split_letters = data_plays["Play"].str.replace(".","", regex=False)
df_letter_scores = pd.concat([split_letters.map(list), data_plays["Score"]], axis=1, keys=["Letter", "Score"]).explode("Letter")
df_letter_scores.head()
| Letter | Score | ||
|---|---|---|---|
| 0 | 2 | Q | 22 |
| 2 | I | 22 | |
| 5 | P | 18 | |
| 5 | O | 18 | |
| 5 | I | 18 |
Lowercase letters represent when a blank was used.
lowercase = re.compile("[a-z]")
df_letter_scores["Letter"] = df_letter_scores["Letter"].replace(lowercase, "[blank]")
Now we will look at the distributions of points scored by words made with each letter.
plt.figure(figsize=(20,10))
order = df_letter_scores.groupby("Letter").median().sort_values(by="Score", ascending=False).index
sns.boxplot(x="Letter", y="Score", data=df_letter_scores, order=order)
plt.title("Distribution of points scored by words made with every letter")
plt.show()
From here we can conclude that the blank is the best tile by far, with S not far behind it. However, notice how for many of the letters, especially the most common letters in the tile bag (R, E, T, N, etc), the distribution is very very skewed to the right, causing the median to greatly undersell it. Additionally, the outliers are so dense that they shouldn't be avoided by using the median. Hence, I believe that using mean as a metric here would be more representative.
plt.figure(figsize=(20,10))
order = df_letter_scores.groupby("Letter").mean().sort_values(by="Score", ascending=False).index
sns.barplot(x="Letter", y="Score", data=df_letter_scores, order=order)
plt.title("Mean points scored by words made with every letter")
plt.show()
Here, with mean as a metric instead, blank and S still reign supreme. However, we notice major shifts in letters like L being a lot higher now, and Q being alot lower now. (from personal experience, there is absolutely no way Q is better than L)
In actual games, to determine how good a rack is, the board to play on is a great factor, however analyzing that here would be too complex. Hence, focusing just on the rack, the next best thing is analyzing tile synergies with each other. What I mean is, while S, R are the 2nd and 3rd most highly ranked letters, having a rack SSSRRR is terrible because few words have that many R's or S's.
This is especially prominent with letters like I ranked pretty high when alone here, but having 2 I's is known to be an absolute killer to your chances of bingo-ing. Hence, we will now analyze the synergies between pairs of letters. Of course, we could take it a step further and analyze synergies between combinations of 3, 4, up to 7 letters, however that is too complex to analyze here.
# function to return pairs of characters from a rack
def get_pairs(s):
return list(chain.from_iterable(combinations(list(s), 2) for r in range(1)))
# we only use racks with 7 letters as many racks are underreported (as a player cannot see their opponent's rack)
split_pairs = data_plays["Rack"][data_plays["Rack"].str.len() == 7].str.upper()
df_pair_scores = pd.concat(
[split_pairs.progress_map(get_pairs), data_plays["Score"]],
axis=1,
keys=["Pair", "Score"]).dropna().explode("Pair")
df_pair_scores
0%| | 0/443672 [00:00<?, ?it/s]
| Pair | Score | ||
|---|---|---|---|
| 0 | 6 | (E, E) | 58 |
| 6 | (E, N) | 58 | |
| 6 | (E, S) | 58 | |
| 6 | (E, S) | 58 | |
| 6 | (E, V) | 58 | |
| ... | ... | ... | ... |
| 39714 | 39 | (E, O) | 48 |
| 39 | (E, S) | 48 | |
| 39 | (L, O) | 48 | |
| 39 | (L, S) | 48 | |
| 39 | (O, S) | 48 |
9317112 rows × 2 columns
def sort_tuple(t):
return tuple(sorted(t))
df_pair_scores["Pair"] = df_pair_scores["Pair"].progress_map(sort_tuple, na_action="ignore")
0%| | 0/9317112 [00:00<?, ?it/s]
pair_medians = df_pair_scores.groupby("Pair")["Score"].mean()
pair_medians
Pair
(?, ?) 63.522077
(?, A) 54.493102
(?, B) 48.802212
(?, C) 53.990541
(?, D) 53.974632
...
(W, Z) 36.335573
(X, Y) 37.499532
(X, Z) 39.845083
(Y, Y) 28.731544
(Y, Z) 39.307054
Name: Score, Length: 373, dtype: float64 tiles = list("?ABCDEFGHIJKLMNOPQRSTUVWXYZ")
df_pair_medians = pd.DataFrame(np.nan, index=tiles, columns=tiles)
for i in pair_medians.index:
df_pair_medians.loc[i[1], i[0]] = pair_medians[i]
df_pair_medians = df_pair_medians.rename(index={"?":"[blank]"}, columns={"?":"[blank]"})
df_pair_medians.head()
| [blank] | A | B | C | D | E | F | G | H | I | ... | Q | R | S | T | U | V | W | X | Y | Z | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [blank] | 63.522077 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| A | 54.493102 | 31.344536 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| B | 48.802212 | 36.775338 | 29.674556 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| C | 53.990541 | 39.560209 | 32.432245 | 31.518138 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| D | 53.974632 | 39.402953 | 35.083603 | 37.628344 | 33.103385 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 27 columns
plt.figure(figsize=(12,10))
sns.heatmap(data=df_pair_medians, cmap="magma")
plt.title("Tile pair synergies")
plt.show()
Firstly, we notice that the triangle has some holes along the diagonal. This is because the letters at the holes (J,K,Q,X,Z) only appear once in the tile bag and are never paired with themselves.
Immediately, we notice that the blank tile seems to have great synergies with every other tile, along with S. R and T also seem to have decent synergies with other tiles. A and E also have moderate synergies with other tiles, along with X and Z. However, X and Z is likely because of their raw scores being high.
Q seems to have the worst synergies with other tiles, surprisingly not even synergizing that well with U. V and W seem to go badly with most other tiles too.
The best synergy is as expected blank with blank, as this gives you many opportunities to bingo. S and blank is not far behind, due to the versatility of S. Some other notable combos include ES, RS and TS.
To investigate this question, we will create multiple metrics of words, to compare with the frequency in competitive play. First, we need to actually get the words frequencies. However, currently the played words include "."s to represent tiles already present on the board when the play was made.
The code below is to fill in the periods of words, which takes about 40 minutes, hence I have commented it out and will read a csv to save time.
alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
a = data_plays.copy()
a["Play"] = a["Play"].map(list)
re1 = re.compile("\d+[A-Z]")
re2 = re.compile("[A-Z]\d+")
coord=[]
def add_coords(row):
pos = row["Position"]
leng = len(row["Play"])
if re.match(re1, pos):
ind = alp.index(pos[-1])
coord.append(np.core.defchararray.add(pos[:-1], alp[ind:ind+leng]))
elif re.match(re2, pos):
ind = num.index(pos[1:])
coord.append(np.core.defchararray.add(num[ind:ind+leng], pos[0]))
else:
print(row)
a.progress_apply(add_coords, axis=1)
a["coord"] = coord
a = a.explode(["Play", "coord"])
a
# def fixdot(row):
# if row["Play"]==".":
# c = a.loc[row.name[0]]
# row["Play"] = c[(c["coord"]==row["coord"])]["Play"].iloc[0]
# return row
# a[a["Play"]=="."] = a[a["Play"]=="."].progress_apply(fixdot, axis=1)
# #print(a.loc[0].to_string())
# a.index.set_names(["index1","index2"], inplace=True)
# data_plays["Play"] = a.groupby(["index1","index2"])["Play"].sum()
# data_plays.to_csv("data_plays.csv")
0%| | 0/719790 [00:00<?, ?it/s]
| Name | Rack | Position | Play | Score | Cum Score | coord | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | >John_O'Laughlin: | IQ | 8G | Q | 22 | 22\r | 8G |
| 2 | >John_O'Laughlin: | IQ | 8G | I | 22 | 22\r | 8H | |
| 5 | >John_O'Laughlin: | IOP | 9E | P | 18 | 40\r | 9E | |
| 5 | >John_O'Laughlin: | IOP | 9E | O | 18 | 40\r | 9F | |
| 5 | >John_O'Laughlin: | IOP | 9E | I | 18 | 40\r | 9G | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 39714 | 42 | >cesar: | DDES | 2G | D | 9 | 401 | 2G |
| 42 | >cesar: | DDES | 2G | . | 9 | 401 | 2H | |
| 42 | >cesar: | DDES | 2G | S | 9 | 401 | 2I | |
| 42 | >cesar: | DDES | 2G | E | 9 | 401 | 2J | |
| 42 | >cesar: | DDES | 2G | D | 9 | 401 | 2K |
3219514 rows × 7 columns
data_plays = pd.read_csv("data_plays.csv", dtype={"Cum Score": str})
data_plays = data_plays.set_index(["Unnamed: 0", "Unnamed: 1"])
data_plays
| Name | Rack | Position | Play | Score | Cum Score | ||
|---|---|---|---|---|---|---|---|
| Unnamed: 0 | Unnamed: 1 | ||||||
| 0 | 2 | >John_O'Laughlin: | IQ | 8G | QI | 22 | 22\r |
| 5 | >John_O'Laughlin: | IOP | 9E | POI | 18 | 40\r | |
| 6 | >Seth: | EENSSVX | 10D | VEX | 58 | 58\r | |
| 7 | >John_O'Laughlin: | AJOS | 11E | SOJA | 42 | 82\r | |
| 8 | >Seth: | ENNSSTU | I9 | UNS | 16 | 74\r | |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 39714 | 36 | >cesar: | ?DDILQS | 8A | cINQ | 66 | 344 |
| 38 | >judy: | BEST | 15A | BEST | 31 | 363 | |
| 39 | >cesar: | CDDELOS | H1 | COLZA | 48 | 392 | |
| 41 | >judy: | CEENRU | K9 | CURER | 14 | 377 | |
| 42 | >cesar: | DDES | 2G | DOSED | 9 | 401 |
561864 rows × 6 columns
word_freq = data_plays["Play"].str.upper().value_counts()
word_data = pd.DataFrame({"Frequency": word_freq})
word_data.index.name = "Word"
word_data = word_data.reset_index()
word_data = word_data[~word_data["Word"].str.contains(".", regex=False)].copy()
word_data
| Word | Frequency | |
|---|---|---|
| 0 | QI | 7323 |
| 1 | QAT | 2767 |
| 2 | XI | 1550 |
| 3 | ZA | 1339 |
| 4 | OI | 1306 |
| ... | ... | ... |
| 64678 | WEEKLIES | 1 |
| 64679 | FLOSSY | 1 |
| 64680 | RASTER | 1 |
| 64681 | PEDUNCLE | 1 |
| 64682 | POWTERING | 1 |
64660 rows × 2 columns
Since these are the only words that appear in the games, we will only calculate the other values for these words only.
# LENGTH
word_data["Length"] = word_data["Word"].str.len()
word_data
| Word | Frequency | Length | |
|---|---|---|---|
| 0 | QI | 7323 | 2 |
| 1 | QAT | 2767 | 3 |
| 2 | XI | 1550 | 2 |
| 3 | ZA | 1339 | 2 |
| 4 | OI | 1306 | 2 |
| ... | ... | ... | ... |
| 64678 | WEEKLIES | 1 | 8 |
| 64679 | FLOSSY | 1 | 6 |
| 64680 | RASTER | 1 | 6 |
| 64681 | PEDUNCLE | 1 | 8 |
| 64682 | POWTERING | 1 | 9 |
64660 rows × 3 columns
# PROBABILITY = Π(count in bag)/100
# VALUE = sum of tile values
word_data["Letters"] = word_data["Word"].apply(list)
exploded = word_data.explode("Letters").reset_index()
relative_bag = tile_bag.copy()
relative_bag["Count"] = relative_bag["Count"] / relative_bag["Count"].sum()
exploded = pd.concat([exploded, relative_bag.loc[exploded["Letters"]].reset_index()], axis=1)
word_data = word_data.merge(exploded.groupby("Word").agg({"Count": "prod", "Value": "sum"}), left_on="Word", right_index=True)
word_data = word_data.rename(columns={"Count": "Probability"})
word_data
| Word | Frequency | Length | Letters | Probability | Value | |
|---|---|---|---|---|---|---|
| 0 | QI | 7323 | 2 | [Q, I] | 9.000000e-04 | 11 |
| 1 | QAT | 2767 | 3 | [Q, A, T] | 5.400000e-05 | 12 |
| 2 | XI | 1550 | 2 | [X, I] | 9.000000e-04 | 9 |
| 3 | ZA | 1339 | 2 | [Z, A] | 9.000000e-04 | 11 |
| 4 | OI | 1306 | 2 | [O, I] | 7.200000e-03 | 2 |
| ... | ... | ... | ... | ... | ... | ... |
| 64678 | WEEKLIES | 1 | 8 | [W, E, E, K, L, I, E, S] | 4.976640e-11 | 15 |
| 64679 | FLOSSY | 1 | 6 | [F, L, O, S, S, Y] | 2.048000e-09 | 12 |
| 64680 | RASTER | 1 | 6 | [R, A, S, T, E, R] | 9.331200e-08 | 6 |
| 64681 | PEDUNCLE | 1 | 8 | [P, E, D, U, N, C, L, E] | 2.211840e-11 | 13 |
| 64682 | POWTERING | 1 | 9 | [P, O, W, T, E, R, I, N, G] | 2.239488e-12 | 15 |
64660 rows × 6 columns
# unigram (word frequency in real life)
word_data = word_data.merge(df_unigram, left_on="Word", right_on="word")
word_data
| Word | Frequency | Length | Letters | Probability | Value | word | count | |
|---|---|---|---|---|---|---|---|---|
| 0 | QI | 7323 | 2 | [Q, I] | 9.000000e-04 | 11 | QI | 1784611 |
| 1 | QAT | 2767 | 3 | [Q, A, T] | 5.400000e-05 | 12 | QAT | 82345 |
| 2 | XI | 1550 | 2 | [X, I] | 9.000000e-04 | 9 | XI | 7589570 |
| 3 | ZA | 1339 | 2 | [Z, A] | 9.000000e-04 | 11 | ZA | 7573829 |
| 4 | OI | 1306 | 2 | [O, I] | 7.200000e-03 | 2 | OI | 3243734 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 35860 | STUBBY | 1 | 6 | [S, T, U, B, B, Y] | 7.680000e-10 | 13 | STUBBY | 287176 |
| 35861 | WEEKLIES | 1 | 8 | [W, E, E, K, L, I, E, S] | 4.976640e-11 | 15 | WEEKLIES | 250787 |
| 35862 | FLOSSY | 1 | 6 | [F, L, O, S, S, Y] | 2.048000e-09 | 12 | FLOSSY | 15751 |
| 35863 | RASTER | 1 | 6 | [R, A, S, T, E, R] | 9.331200e-08 | 6 | RASTER | 1436870 |
| 35864 | PEDUNCLE | 1 | 8 | [P, E, D, U, N, C, L, E] | 2.211840e-11 | 13 | PEDUNCLE | 69407 |
35865 rows × 8 columns
sns.pairplot(data=word_data, x_vars=["Length", "Probability", "Value", "count"], y_vars=["Frequency"])
plt.title("Scatter plots of various variables against word usage frequency")
plt.show()
Every variable seems to potentially have correlation with word frequency in plays, other than value. We will now investigate each one.
sns.scatterplot(x="count", y="Frequency", data=word_data)
plt.title("Scatter plot of word usage frequency vs frequency in natural language")
plt.xlabel("Frequency in natural language")
plt.show()
Notice that the few outliers x- and y-wise zooms the graph out a lot, so we shall manually zoom in on the dense part by changing the axes ranges. (also take log to unsquish the data)
plt.figure(figsize=(10,10))
sns.scatterplot(x=np.log(word_data["count"]), y="Frequency", data=word_data)
plt.ylim(0,2000)
plt.title("Scatter plot of word usage frequency vs ln of frequency in natural language")
plt.show()
From this graph, we can clearly see there is no correlation. Moving on to length, since length is highly discrete we can use lineplots instead.
plt.figure(figsize=(10,10))
sns.lineplot(x="Length", y="Frequency", data=word_data)
plt.title("Word usage frequency vs length of word")
plt.show()
From this line plot, we can see how there is a decreasing trend starting from length 2, as longer words would have more combinations, each individual word becomes rarer and rarer (there are a lot more 3-letter words than 2-letter words), which we can also verify now.
lengthcounts = word_data.groupby("Length").size()
sns.barplot(x=lengthcounts.index, y=lengthcounts)
plt.title("Number of words of each length")
plt.ylabel("Count")
plt.show()
Finally, we can investigate the effect of probability on frequency.
sns.scatterplot(x="Probability", y="Frequency", data=word_data)
plt.title("Scatter plot of word usage frequency vs probability estimate of drawing the word")
plt.show()
Since the x-axis seems to squish the graph a lot, we decide to take ln of the x-axis. Additionally, from our previous insights on length, we might use it as a hue.
plt.figure(figsize=(20,20))
sns.scatterplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, hue="Length", palette="viridis")
sns.regplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, scatter=False)
plt.ylim(top=1000)
plt.title("Word usage frequency vs ln of probability estimate of drawing the word, with colour based on length")
plt.xlabel("ln(Probability)")
plt.show()
From here, we can infer that there is definitely some positive correlation between ln(probability) and frequency, that truly takes off at a probability of around e^(-20) = 2 * 10^-9. We also notice that from the hue, as the probability and frequency increase, the word length tends to decrease.
However, notice that despite how high the frequency values seem to get, the linear regression line is docked because of the literal horizontal line of points just above 0. This is because of the many 6-8 letter words which were unlikely to be played more than once in the dataset, forming a line. Hence, we shall see what happens if we do not consider very low frequencies (<20).
plt.figure(figsize=(20,20))
word_data2 = word_data[word_data["Frequency"] > 20]
sns.scatterplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, hue="Length", palette="viridis")
sns.regplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, scatter=False)
plt.title("Word usage frequency(min. 20) vs ln of probability estimate of drawing the word, with colour based on length")
plt.ylim(top=1000)
plt.xlabel("ln(Probability)")
plt.show()
Without super small values, we can see that the linear regression line has a higher gradient now and a trend is more visible among the data points. We notice that despite the "upper bound" of the data following a very clear upwards trend, taking a regression of the entire data set does not yield as clear of a trend.
This is because my dataset is not large enough for these obscure 6-8 letter words to be played numerous times, and hence there will always be a sea of data points below this "upper bound" as many of these words just happened to be played 1 time. With enough data, we should expect the "lower bound" to lift at some point after the "upper bound" lifts.
Another issue is that my "probability" statistic is flawed. It is not true probability, as calculation of the true probability of drawing a set of tiles is very complex and requires some amount of combinatorics and a few equations to arrive at. My probability is only an estimate, and especially falls apart when a word has many duplicate letters, overrating it by a lot. This also plays a part in placing data points too far to the right, contributing to the unclearness of a trend here.
However, assuming a linear relationship between log(probability) and frequency (aka a logarithmic relationship between probability and frequency), we obtain the pearson coefficient and p-value:
stats.pearsonr(np.log(word_data["Probability"]), word_data["Frequency"])
(0.2900616449988278, 0.0)
The pearson coefficient at least indicates a weak positive relationship between the two, with a low p-value (due to the sheer amount of data points). Hence, we can confidently conclude at least a weak positive relationship between probability of obtaining a word, and frequency of it being played.
Finally, we attempt to use all the variables so far to construct a multiple linear regression model for the word frequency. We will first try it with all the values (without removing values below 20)
mlm = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(word_data[["Length", "Probability", "Value", "count"]],
word_data["Frequency"],
test_size=0.2,
random_state=0)
mlm.fit(X_train, y_train)
LinearRegression()
The equation for this MLRM is given by:
eq = f"Predicted frequency = {mlm.coef_[0]} * Length + {mlm.coef_[1]} * Probability + {mlm.coef_[2]} * Value + {mlm.coef_[3]} * count + {mlm.intercept_}"
eq
'Predicted frequency = -12.015952458932638 * Length + 226.35067586307215 * Probability + 0.8716140023561847 * Value + 1.6441049410786945e-08 * count + 77.05958650292018'
Now, we can use the MLRM to predict some values and see how well it matches with the actual values.
yhat = mlm.predict(X_test)
plt.figure(figsize=(16,5))
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(yhat, color='b', label='Fitted Value')
plt.legend()
plt.title("Actual and fitted values of word usage frequency from MLRM based on length, probability, value and count")
plt.show()
We will now zoom in on the spikes.
plt.figure(figsize=(16,5))
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(yhat, color='b', label='Fitted Value')
plt.xlim(right=200)
plt.title("Actual and fitted values of word usage frequency from MLRM based on length, probability, value and count")
plt.legend()
plt.show()
residuals = y_test - yhat
sns.scatterplot(x=yhat, y=residuals)
plt.plot([yhat.min(),yhat.max()], [0,0], color="r")
plt.title("Residuals of MLRM")
plt.xlabel("Predicted values")
plt.show()
From the comparison, we can see that the MLRM is definitely not suitable to fit the frequency based on the other 4 variables. Not only does the predicted frequency peak at much lower, it seems to have multiple peaks, going erratic at the range 0-50.
From the residual plot, we can also see how the residuals very greatly fan out to a residual of about 400, meaning the MLRM is less accurate at higher values(at the curve)
Now, we shall try it with values only above 20.
mlm2 = LinearRegression()
X_train2, X_test2, y_train2, y_test2 = train_test_split(word_data2[["Length", "Probability", "Value", "count"]],
word_data2["Frequency"],
test_size=0.2,
random_state=0)
mlm2.fit(X_train2, y_train2)
LinearRegression()
The equation for this MLRM is given by:
eq2 = f"Predicted frequency = {mlm2.coef_[0]} * Length + {mlm2.coef_[1]} * Probability + {mlm2.coef_[2]} * Value + {mlm2.coef_[3]} * count + {mlm2.intercept_}"
eq2
'Predicted frequency = -54.28146881385214 * Length + 39361.04847412522 * Probability + 6.5757222873632495 * Value + 6.126356311142445e-09 * count + 234.76888355903765'
Now, we can use this MLRM to predict some values and see how well it matches with the actual values.
yhat2 = mlm2.predict(X_test2)
plt.figure(figsize=(16,5))
sns.kdeplot(y_test2, color='r', label='Actual Value')
sns.kdeplot(yhat2, color='b', label='Fitted Value')
plt.xlim(right=500)
plt.title("Actual and fitted values of word usage frequency(min. 20) from MLRM based on length, probability, value and count")
plt.legend()
plt.show()
residuals2 = y_test2 - yhat2
sns.scatterplot(x=yhat2, y=residuals2)
plt.plot([yhat2.min(),yhat2.max()], [0,0], color="r")
plt.title("Residuals of MLRM")
plt.ylim(top=1000)
plt.xlabel("Predicted values")
plt.show()
From the kde comparison, we can see that this MLRM is much better at predicting the values, with a peak way closer to the actual peak. However, this peak is still quite off to the right and above, overestimating many of the values from around 50 - 150.
From the residual plot, we can see that the residuals still fan out, meaning the MLRM is less accurate at higher values(at the curve), however it fans out now to a residual of only about 200 rather than 400.
Hence, without the values of frequency below 20, the MLRM is much better at fitting the data.
Here, "pattern" refers to board pattern of the words played. First, we will look at the densities of all tiles played.
#a["coord1"] = a["coord"].str.extract("([A-Z])")
#a["coord2"] = a["coord"].str.extract("(\d+)")
a
| Name | Rack | Position | Play | Score | Cum Score | coord | ||
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | >John_O'Laughlin: | IQ | 8G | Q | 22 | 22\r | 8G |
| 2 | >John_O'Laughlin: | IQ | 8G | I | 22 | 22\r | 8H | |
| 5 | >John_O'Laughlin: | IOP | 9E | P | 18 | 40\r | 9E | |
| 5 | >John_O'Laughlin: | IOP | 9E | O | 18 | 40\r | 9F | |
| 5 | >John_O'Laughlin: | IOP | 9E | I | 18 | 40\r | 9G | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 39714 | 42 | >cesar: | DDES | 2G | D | 9 | 401 | 2G |
| 42 | >cesar: | DDES | 2G | . | 9 | 401 | 2H | |
| 42 | >cesar: | DDES | 2G | S | 9 | 401 | 2I | |
| 42 | >cesar: | DDES | 2G | E | 9 | 401 | 2J | |
| 42 | >cesar: | DDES | 2G | D | 9 | 401 | 2K |
3219514 rows × 7 columns
alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
board = pd.DataFrame(columns=alp, index=num,)
coord_counts = pd.DataFrame(a["coord"].value_counts())
coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values
def add_coord(row):
board[row["coord1"]][row["coord2"]] = row["coord"]
coord_counts.apply(add_coord, axis=1)
board = board.astype(int)
board
| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 10061 | 8050 | 8055 | 10374 | 7330 | 8155 | 9573 | 21126 | 9831 | 8937 | 8291 | 13644 | 12002 | 12807 | 17085 |
| 2 | 7853 | 10957 | 10605 | 11630 | 8740 | 12212 | 9272 | 17722 | 9265 | 13924 | 10987 | 14308 | 13582 | 15306 | 12791 |
| 3 | 7843 | 10524 | 15750 | 15189 | 11691 | 11241 | 10916 | 17551 | 12409 | 13522 | 13662 | 17650 | 18994 | 13245 | 12372 |
| 4 | 10125 | 11419 | 14899 | 20159 | 15213 | 13104 | 10091 | 20993 | 12319 | 15442 | 17561 | 23134 | 16863 | 12987 | 13649 |
| 5 | 7269 | 8917 | 11755 | 15504 | 20231 | 14665 | 11728 | 17528 | 12989 | 16369 | 22632 | 17641 | 13293 | 10889 | 10595 |
| 6 | 8244 | 12295 | 11366 | 13554 | 15835 | 17140 | 11637 | 14957 | 12826 | 19304 | 17868 | 14298 | 11899 | 14037 | 11766 |
| 7 | 9766 | 9308 | 11096 | 10498 | 13636 | 12624 | 13036 | 18419 | 15442 | 16095 | 15225 | 11572 | 12718 | 12030 | 13187 |
| 8 | 18630 | 15116 | 14246 | 24781 | 22494 | 20834 | 28407 | 37011 | 24321 | 19880 | 17982 | 21525 | 19286 | 20037 | 25175 |
| 9 | 9772 | 9426 | 13270 | 14427 | 15506 | 14527 | 15465 | 17623 | 13852 | 13290 | 13624 | 11795 | 12756 | 11407 | 12649 |
| 10 | 8822 | 14658 | 13884 | 17134 | 17781 | 18724 | 13877 | 15810 | 12375 | 17714 | 16870 | 14861 | 12804 | 14573 | 11378 |
| 11 | 8179 | 11373 | 14087 | 18623 | 22492 | 16241 | 12906 | 17248 | 12550 | 16045 | 21830 | 17141 | 13354 | 11718 | 9922 |
| 12 | 13444 | 14678 | 18026 | 23329 | 16786 | 12954 | 10520 | 20986 | 11242 | 14011 | 16886 | 21493 | 16028 | 13069 | 12641 |
| 13 | 11771 | 13697 | 19216 | 16610 | 12791 | 11326 | 12614 | 21396 | 12927 | 12787 | 13285 | 16015 | 17591 | 12538 | 11046 |
| 14 | 12717 | 15585 | 13494 | 12852 | 10674 | 13995 | 11839 | 22048 | 11311 | 14383 | 11969 | 13115 | 12704 | 14310 | 11139 |
| 15 | 17245 | 13062 | 12676 | 13917 | 10848 | 11884 | 13239 | 26732 | 12349 | 11018 | 10089 | 12619 | 11237 | 11276 | 14722 |
fig = px.imshow(board, color_continuous_scale="magma", title="Heat map of tile frequency across board")
fig.update_layout(width=700,height=700)
fig.update_traces(text=board_layout, texttemplate="%{text}")
fig.update_layout(updatemenus=[
dict(
type = "buttons",
direction = "left",
buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])]
)]
)
fig.show()
The square with the highest frequency of letters is the central square, which is because every game must start with a word through that square.
Looking at the edge TWs (A8,O8,H1,H15), it seems that the East and South ones are more favoured over the other two.
Looking at the corner TWs, it seems that the North-West TW (A1) is used a lot less than the other 3 TWs.
Generally, tiles seem to be more clustered around the multipliers (2L and 3W down the middle horizontal and vertical / 2W along the diagonals)
We will now look at other tiles of interest, starting with the blank.
blanks = a[a["Play"].str.islower()]
alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
board = pd.DataFrame(columns=alp, index=num)
coord_counts = pd.DataFrame(blanks["coord"].value_counts())
coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values
def add_coord(row):
board[row["coord1"]][row["coord2"]] = row["coord"]
coord_counts.apply(add_coord, axis=1)
board = board.astype(int)
fig = px.imshow(board, color_continuous_scale="magma", title="Heat map of blank tile's frequency across board")
fig.update_layout(width=700,height=700)
fig.update_traces(text=board_layout, texttemplate="%{text}")
fig.update_layout(updatemenus=[
dict(
type = "buttons",
direction = "left",
buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])]
)]
)
fig.show()
Here, we see quite the opposite. The blank seems to rarely go on the multipliers this time, mostly the DLs and TLs.
This is because the blank is worth 0 points, hence it would be very wasteful to put it onto a DL or TL as the multiplier would be wasted.
Qs = a[a["Play"] == "Q"]
Zs = a[a["Play"] == "Z"]
Js = a[a["Play"] == "J"]
Xs = a[a["Play"] == "X"]
boards = []
alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
for i in tqdm([Qs, Zs, Js, Xs]):
board = pd.DataFrame(columns=alp, index=num)
coord_counts = pd.DataFrame(i["coord"].value_counts())
coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values
def add_coord(row):
board[row["coord1"]][row["coord2"]] = row["coord"]
coord_counts.apply(add_coord, axis=1)
board = board.astype(int)
boards.append(board)
subplots = make_subplots(rows=2, cols=2, subplot_titles=["Q","Z","J","X"])
for i in range(4):
row = i//2+1
col = i%2+1
fig = px.imshow(boards[i], color_continuous_scale="magma")
fig.update_layout(width=700,height=700)
fig.update_traces(text=board_layout, texttemplate="%{text}")
for trace in fig["data"]:
subplots.append_trace(trace, row=row, col=col)
subplots.update_layout(width=900,height=1000,
title="Heat map of QZJX tiles' frequencies across board",
title_y=0.99,
updatemenus=[
dict(
type = "buttons",
direction = "left",
buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])],
pad={"r": 10, "t": 20},
showactive=True,
x=0.11,
xanchor="left",
y=1.1,
yanchor="top"
)
]
)
subplots.show()
0%| | 0/4 [00:00<?, ?it/s]
All 4 of these tiles are clearly more dense along the multipliers, even showing some extra density along the 2W diagonals.
They are all particularly dense at the 3L's, as players usually want to place these high-value tiles at the letter multipliers for the high bonus.
plt.figure(figsize=(16,5))
plt.subplot(1, 2, 1)
sns.scatterplot(x="Frequency in CSW21", y="Frequency in bag", data=df_letters, s=100, color="lightgray")
def label_point(row):
ax = plt.gca()
ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["Frequency in bag"]))
df_letters.apply(label_point, axis=1)
mi = min(df_letters["Frequency in CSW21"].min(), df_letters["Frequency in bag"].min())
ma = max(df_letters["Frequency in CSW21"].max(), df_letters["Frequency in bag"].max())
plt.plot([mi,ma],[mi,ma], color="red")
plt.title("Comparison of letter frequency in tile bag vs actual word list (CSW21)")
plt.subplot(1, 2, 2)
sns.regplot(x="Frequency in CSW21", y="ln Relative value", data=df_letters)
def label_point(row):
ax = plt.gca()
ax.annotate(row["Letter"], (row["Frequency in CSW21"], row["ln Relative value"]))
coef, pval = stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])
#coef = "Pearson coefficient = "+str(stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])[0])
#pval = "p-value = "+str(stats.pearsonr(df_letters["Frequency in CSW21"], df_letters["ln Relative value"])[1])
plt.annotate(f"Pearson coefficient = {coef}", (0.01, -5.0))
plt.annotate(f"p-value = {pval}", (0.01, -5.5))
df_letters.apply(label_point, axis=1)
plt.title("Ln of letter's relative value vs frequency in word list (CSW21)")
plt.show()
The frequency of tiles in the bag, when compared with the frequency of those letters in the word list CSW21, are very close as can be seen by the points lying near the red y=x line in the figure on the left. S is an outlier, lying far below it, as it was purposely given a lower frequency as it is very versatile (hooking onto the back of most nouns and verbs),
The values of tiles in the bag, when taken logarithmically, form a strong linear correlation with the frequency of those letters in the word list CSW21, with a pearson coefficient of about -0.87, very close to -1. (the p-value may not be significant as we have few data points). Hence, the values of tiles in the bag are closely exponentially correlated with the frequency of those letters in the word list.
With both the frequency and values of tiles in the bag being highly correlated with the frequences of those letters in the word list, we conclude that the tile bag is very accurate to this day.
plt.figure(figsize=(20,10))
order = df_letter_scores.groupby("Letter").mean().sort_values(by="Score", ascending=False).index
sns.barplot(x="Letter", y="Score", data=df_letter_scores, order=order)
plt.title("Mean points scored by words made with every letter")
plt.show()
According to the metric of average points scored by plays with each letter across all games in the dataset, we can see that blank is by far the best tile, with S not too far behind.
R,T,E,Z,N,X are the next 6 best tiles. R,T,E,N are there because of how commonly found in words they are, making them quite versatile when wanting to form any word, but especially bingos. Z,X are there because of their high tile value of 10,8 respectively, coupled with their high compatibility with other letters (ZA,ZO, AX,EX,OX,XI) and how they can be quite commonly found in normal words, more so than other rare letters like Q and J. S is ahead of them because in addition to how commonly its found in normal words, it is also very versatile in hooking onto the back of almost any verb or noun (e.g. EAT(s), BALL(s)).
U,W,Y are the worst 3 letters. V is poorly rated partially because it is the only letter that does not form any 2-letter words, which is the backbone of most scrabble plays. While U is a vowel, it is the worst one, appearing far less commonly than the other vowels.
However, there is a major problem with this metric. It does not consider how certain tiles may cause the player to miss out on points by invaluably taking up a space on the rack. For example, Q's sheer incompatibility with other tiles makes it a tile that players want to get rid of as fast as possible, as it basically completely blocks their chance of bingo-ing.
plt.figure(figsize=(12,10))
sns.heatmap(data=df_pair_medians, cmap="magma")
plt.title("Tile pair synergies")
plt.show()
Synergies between pairs of tiles were also measured by considering the points scored from racks with certain pairs of letters. This not only illustrates which pairs of tiles synergize particularly well, but also helps to give context for the last graph.
Along the diagonal, the tiles are generally darker than surrounding tiles, because letters tend to synergize poorly with themselves. Short english words don't tend to often contain duplicate letters, especially consonants. J, K, Q, X, Z are missing from this diagonal as they only appear once in the tile bag and are hence never paired with themselves.
The very light column on the left shows how the blank synergizes well with almost every other letter. This is expected, as the blank can become any letter.
R,S,T also form a nice stripe across the map, with S being lighter. S as expected synergizes well with every letter as a result of its versatility in hooking onto letters. R and T also have good synergies simply because of how good they are (in previous graph), and the same for X and Z.
We also notice Q forms the darkest stripe across the heat map simply because of how little words contain it. Notice that the best synergy Q has is with U, as Q is almost always found with U (e.g. QUICK, QUAD, PIQUE, ...)
V and W also form quite dark stripes across the heat map, which gives further context as to why they were scored so poorly on the previous map, partially because they do not form any good synergies.
The best synergies are blank with blank, and blank with S, for obvious reasons as already stated. Other notably good synergies (not including blank) are SE SR ST, which is probably because E,R,T are also the 3 most highly ranked singular tiles in the previous graph, after blank and S.
The worst synergies seem to be along the diagonal, specifically U,U and V,V. Unlike "EE" or "OO", "UU" is a very rare digraph, and short words containing 2 Us are few and far between. V is even worse, which becomes obvious once you try thinking of a word containing 2 V's (there are very few).
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(15, 15))
plt.setp(ax, xlim=(-20,0), ylim=(-500,2000))
sns.scatterplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, hue="Length", palette="viridis", ax=ax[0])
sns.regplot(x=np.log(word_data["Probability"]), y="Frequency", data=word_data, scatter=False, ax=ax[0])
#plt.ylim(top=1000)
ax[0].set_title("Word usage frequency vs ln of probability, with colour based on length")
ax[0].set_xlabel("ln(Probability)")
coef, pval = stats.pearsonr(np.log(word_data["Probability"]), word_data["Frequency"])
ax[0].annotate(f"Pearson coefficient = {coef}", (-17.5, 1500))
ax[0].annotate(f"p-value = {pval}", (-17.5, 1000))
sns.scatterplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, hue="Length", palette="viridis", ax=ax[1])
sns.regplot(x=np.log(word_data2["Probability"]), y="Frequency", data=word_data2, scatter=False, ax=ax[1])
#plt.ylim(top=1000)
ax[1].set_title("Word usage frequency(min. 20) vs ln of probability, with colour based on length")
ax[1].set_xlabel("ln(Probability)")
coef, pval = stats.pearsonr(np.log(word_data2["Probability"]), word_data2["Frequency"])
ax[1].annotate(f"Pearson coefficient = {coef}", (-17.5, 1500))
ax[1].annotate(f"p-value = {pval}", (-17.5, 1000))
plt.show()
The above plots show frequency of words played in games in the dataset against ln of the probability of getting these words. Probability here was calculated by taking the relative frequency of each tile of the word in the bag and multiplying them together. Points were plotted twice with and without words with <=20 frequency.
In both cases, we can see a weak positive correlation with positive pearson coefficient values near 0.3 and low p-values. However, we notice that in the plot with all the words, there is a very dense horizontal line at the bottom of the points, which causes the fitted line to tend much closer to that line, docking its gradient. Hence we are motivated to replot it with removal of the words with low frequency.
After removal, the pearson coefficient noticeably increases from ~0.29 to ~0.31, with the p-value still being very small. The words at the lowest frequencies are not represented properly in this data set, as they just happen to appear once and hence create a rock solid line. In a larger data set where every word is represented more fairly, these words would have differences between each other and would not "round" to 1.
Regardless, we can conclude a weak positive (exponential) correlation between the probability of words and their frequency in play.
#plt.figure(figsize=(16,5))
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))
sns.kdeplot(y_test, color='r', label='Actual Value', ax=ax[0,0])
sns.kdeplot(yhat, color='b', label='Fitted Value', ax=ax[0,0])
ax[0,0].set_xlim(right=100)
ax[0,0].set_title("Actual and fitted values of word usage frequency from MLRM")
ax[0,0].legend()
sns.scatterplot(x=yhat, y=residuals, ax=ax[1,0])
ax[1,0].plot([yhat.min(),yhat.max()], [0,0], color="r")
ax[1,0].set_title("Residuals of word usage frequency from MLRM")
ax[1,0].set_ylim((-200, 1000))
ax[1,0].set_xlabel("Predicted values")
sns.kdeplot(y_test2, color='r', label='Actual Value', ax=ax[0,1])
sns.kdeplot(yhat2, color='b', label='Fitted Value', ax=ax[0,1])
ax[0,1].set_xlim(right=500)
ax[0,1].set_title("Actual and fitted values of word usage frequency(min. 20) from MLRM")
ax[0,1].legend()
sns.scatterplot(x=yhat2, y=residuals2, ax=ax[1,1])
ax[1,1].plot([yhat2.min(),yhat2.max()], [0,0], color="r")
ax[1,1].set_title("Residuals of word usage frequency(min. 20) from MLRM")
ax[1,1].set_ylim((-200, 1000))
ax[1,1].set_xlabel("Predicted values")
plt.show()
The above plots show MLRMs of frequency of words played in dataset based on length, probability, value and count, with or without removing words with frequency <= 20.
On the left, when all words are used, we can see that the fitted values undershoot the peak of the actual values by a lot, and also go a little erratic.
On the right, when only words with frequency >20 are used, we can see that the fitted values fit the actual values much better, only slightly overshooting the peak of the actual values kde, being slightly off to the right.
Comparing the residual plots (same axes were used), we can see that both of them have a fan shape meaning that the higher the word usage frequency the more inaccurate the predictions were. However, we can see that the residuals from the left plot are still further from the y=0 line as they go upwards to around 300, and are not very symmetrical about the y=0 line. On the right, the residual plot is much more symmetrical along the y=0 line, and the residuals do not get as big, only really getting as big as +-200.
Hence, the multiple linear regression model was better at fitting the data when the words with frequency <=20 were removed. This is likely because as my dataset is too small, there are way too many words that have very small frequency, as someone might have just played it by chance and it was never played by anyone else in the data set. This effectively creates a dense horizontal line at the base, which hurts the accuracy of any model used.
blanks = a[a["Play"].str.islower()]
Qs = a[a["Play"] == "Q"]
Zs = a[a["Play"] == "Z"]
Js = a[a["Play"] == "J"]
Xs = a[a["Play"] == "X"]
boards = []
alp = list("ABCDEFGHIJKLMNO")
num = list(np.arange(1,16).astype(str))
for i in tqdm([a, blanks, Qs, Zs, Js, Xs]):
board = pd.DataFrame(columns=alp, index=num)
coord_counts = pd.DataFrame(i["coord"].value_counts())
coord_counts["coord1"] = coord_counts.index.str.extract("([A-Z])").values
coord_counts["coord2"] = coord_counts.index.str.extract("(\d+)").values
def add_coord(row):
board[row["coord1"]][row["coord2"]] = row["coord"]
coord_counts.apply(add_coord, axis=1)
board = board.astype(int)
board = board / board.max()
boards.append(board)
subplots = make_subplots(rows=3, cols=2, subplot_titles=["All","Blank","Q","Z","J","X"])
for i in range(6):
row = i//2+1
col = i%2+1
fig = px.imshow(boards[i], color_continuous_scale="magma")
fig.update_layout(width=700,height=700)
fig.update_traces(text=board_layout, texttemplate="%{text}")
for trace in fig["data"]:
subplots.append_trace(trace, row=row, col=col)
subplots.update_layout(width=900,height=1500,
title="Heat map of various tiles' frequencies across board",
title_y=0.99,
updatemenus=[
dict(
type = "buttons",
direction = "left",
buttons=[dict(label="Show mults",method="update",args=[{"text": [board_layout]}]),
dict(label="Hide mults",method="update",args=[{"text": [np.full((15,15),"")]}])],
pad={"r": 10, "t": 20},
showactive=True,
x=0.11,
xanchor="left",
y=1.1,
yanchor="top"
)
]
)
subplots.show()
0%| | 0/6 [00:00<?, ?it/s]
The square with the highest frequency of letters is the central square, which is because every game must start with a word through that square.
Looking at the squares adjacent to the central square, we can see that players tend to favour playing their first word horizontally than vertically.
Looking at the edge TWs (A8,O8,H1,H15), it seems that the East and South ones are more favoured over the other two.
Looking at the corner TWs, it seems that the North-West TW (A1) is used a lot less than the other 3 TWs.
This disproportionate edge and corner TW usage leads me to believe that players find it easier/more often play words that hook from the start of the word, ending into the multiplier (E.g. in the North-West TW, the word has to start with a letter on that square), rather than hooking from the end of the word, with the start of the word landing on the multiplier.
This can be partially seen with having an S giving you a 1-letter lenience in terms of the end of your word, as you can often choose to use or not to use it, whereas front hooks (letters that go BEFORE a word) are much scarcer.
Generally, tiles seem to be more clustered around the multipliers (2L and 3W down the middle horizontal and vertical / 2W along the diagonal
Here, we see quite the opposite. The blank seems to rarely go on the multipliers this time, mostly the DLs and TLs.
This is because the blank is worth 0 points, hence it would be very wasteful to put it onto a DL or TL as the multiplier would be wasted.
We again notice similar disparities between the TWs of the board, however this time the blank much more commonly goes on the south TW (H15) than the west TW (A8). While this imbalance is not as present for the other 2 TWs, I believe this can also be in part explained by S-hooks making back hooks much more prevalent than front hooks.
Q, Z, J and X are the highest scoring tiles, with Q and Z at 10 points, and J and X at 8 points.
All 4 of these tiles are clearly more dense along the multipliers, even showing some extra density along the 2W diagonals.
They are all particularly dense at the 3L's, as players usually want to place these high-value tiles at the letter multipliers for the high bonus. X and Z are particularly dense at the 3L's compared to the others, likely as they are the most versatile in terms of words containing it and whether words can start or end with it. (many words start with Z or end with Z/X, while little end with Q/J)
Q and J are also noticeably less dense in the top right triangle of the board, especially at the corner and edge TWs (H15, O15, O8), and the converse is with X, being less dense in the bottom left triangle of the board. I am not sure why this happens, however I suspect it is to do with few words starting with X, and few words ending with Q and J.
To scrabble players, it is recommended that you dump tiles like U, V and W as fast as possible, as they have been shown to be quite bad. S and blank (along with R,T,E) are more worth it to hold on to.
I think it would be nice if the relationship between word frequency and probability could be examined with a proper measurement of probability, and with a larger data set. Additionally, more work can be done with the true value of tiles, considering the aspect of how long people hold on to tiles. I also wish I had time to do more analysis on the progress of scrabble with respect to time, over the years.